Introduction

The Elite: Dangerous Database (EDDB) is a collection of data for the game Elite: Dangerous. The data is split into several different databases:

  • Prices of goods (listings.csv)
  • Stations (stations.json)
  • Populated Systems (systems_populated.json)
  • Factions (factions,json)
  • Commodities (commodities.json)
  • Modules (modules.json)

This notebook will be an introduction to each data set and provide some exploratory data analysis of the data sets. This EDA will also act as prototypes for the information I will want to display in the Shiny app.

library(tidyverse)
library(jsonlite)
library(plotly)

Prices

The listings.csv file gives a read out of the prices of goods at the time the file was updated.

listings <- read_csv("data/listings.csv", progress = FALSE)
listings

The columns for the listings are:

  • id: The id of the particular good
  • station_id: The station where the good is being sold
  • commodity_id: The good being sold (related to commodities.json)
  • supply: How much of the item is available for purchase
  • supply_bracket: Unsure of purpose
  • buy_price: Price the good can be bought at
  • sell_price: Price the good can be sold at
  • demand: How many items the station wants
  • demand_bracket: Unsure of purpose
  • collected_at: Time the data was collected

We can then look at a summary of the data to get an idea of what is in the data set.

summary(listings)
       id            station_id     commodity_id        supply        
 Min.   :      1   Min.   :    1   Min.   :  1.00   Min.   :       0  
 1st Qu.:1187777   1st Qu.:13470   1st Qu.: 22.00   1st Qu.:       0  
 Median :2338836   Median :27759   Median : 49.00   Median :       0  
 Mean   :2615026   Mean   :29000   Mean   : 87.71   Mean   :   27006  
 3rd Qu.:3698904   3rd Qu.:42522   3rd Qu.:103.00   3rd Qu.:      31  
 Max.   :6900247   Max.   :69881   Max.   :330.00   Max.   :23917110  
                                                                      
 supply_bracket     buy_price          sell_price         demand         
 Min.   :0.00     Min.   :     0.0   Min.   :     0   Min.   :        0  
 1st Qu.:0.00     1st Qu.:     0.0   1st Qu.:   547   1st Qu.:        0  
 Median :0.00     Median :     0.0   Median :  1357   Median :      664  
 Mean   :0.56     Mean   :   535.2   Mean   :  5247   Mean   :   116685  
 3rd Qu.:1.00     3rd Qu.:    63.0   3rd Qu.:  3980   3rd Qu.:    12064  
 Max.   :3.00     Max.   :113355.0   Max.   :257990   Max.   :583601700  
 NA's   :218269                                                          
 demand_bracket    collected_at      
 Min.   :0.00     Min.   :1.454e+09  
 1st Qu.:0.00     1st Qu.:1.523e+09  
 Median :2.00     Median :1.527e+09  
 Mean   :1.81     Mean   :1.523e+09  
 3rd Qu.:3.00     3rd Qu.:1.528e+09  
 Max.   :3.00     Max.   :1.529e+09  
 NA's   :218269                      

From the summary, we can see that there are number of NA’s in the supply_bracket and demand_backet columns. Since we don’t know what those do, we can ignore those for the time being. Additionally, there are zeroes in the sell_price and buy_price columns. Those are essentially NA’s as a sell_price of 0 means that you can’t sell the product there and a buy_price of 0 means there are none available at that station.

listings %>% 
  filter(commodity_id == 1) %>% 
  select(buy_price, sell_price) %>% 
  gather(type, price) %>% 
  filter(price > 0) %>% 
  group_by(type) %>% 
  mutate(average = mean(price), median = median(price)) %>% 
  ggplot(aes(x = price, fill = type)) +
  geom_histogram(bins = 100) +
  facet_grid(type ~.) +
  geom_vline(aes(xintercept = average)) +
  geom_vline(aes(xintercept = median), linetype = "dotted")

This figure gives a quick histogram of the buy and sell prices of a single commodity across all available stations. Note that it filters out any buy or sell prices of 0. The solid line is the mean of the distribution and the dotted line is the median of the distribution. From the figure, an interesting topic to investigate is looking at comparative buy and sell prices across stations to create a system to search for the max difference.

listings %>% 
  filter(station_id == 12) %>% 
  select(buy_price, sell_price) %>% 
  gather(type, price) %>% 
  filter(price > 0) %>% 
  group_by(type) %>% 
  mutate(average = mean(price), median = median(price)) %>% 
  ggplot(aes(x = price, fill = type)) +
  geom_histogram(bins = 20) +
  facet_grid(type ~.) +
  geom_vline(aes(xintercept = average)) +
  geom_vline(aes(xintercept = median), linetype = "dotted")

This figure looks at a histogram of buy and sell prices for a specific station (across all available commodities). Again, the solid line is the mean and the dotted line is the median.

Questions

I want to put together some initial questions to answer using my Shiny app.

  • What is the biggest price difference?
  • What is the distance between the stations with the biggest price difference?
  • Given a particular commodity, where can I find it and at what price?
  • Given a particular commodity, where can I sell it and at what price?
  • How do the buying and selling price compare the the galatic average (and reported average average and median)?

Stations

stations <- fromJSON("data/stations.json")
Error in fromJSON("data/stations.json") : 
  could not find function "fromJSON"

The stations data has 39 columns in it, the most relevant of which are the following:

  • id: The station’s id
  • name: The station’s name
  • system_id: The system id where the station resides
  • updated_at: Time at which the data was updated
  • distance_to_star: Distance of the station from the system’s star
  • has_ : A set of booleans showing what amenities the station offers
  • is_planetary: A boolean showing whether the station is on a planet
  • selling_ships: A list of the ships being sold

There are also a number of other useful variables showing additional information about the stations.

We can summarize the most relevant columns and get a sense of distributions and any possible NAs.

stations %>% 
  select(id, system_id, updated_at, distance_to_star, starts_with("has_"), is_planetary) %>% 
  summary()
       id          system_id          updated_at        distance_to_star 
 Min.   :    1   Min.   :       1   Min.   :1.479e+09   Min.   :      2  
 1st Qu.:18114   1st Qu.:    4843   1st Qu.:1.527e+09   1st Qu.:    200  
 Median :35262   Median :   10419   Median :1.528e+09   Median :    868  
 Mean   :35128   Mean   :   62959   Mean   :1.527e+09   Mean   :  17005  
 3rd Qu.:52285   3rd Qu.:   15469   3rd Qu.:1.529e+09   3rd Qu.:   2670  
 Max.   :69883   Max.   :17805395   Max.   :1.529e+09   Max.   :6783706  
                                                        NA's   :1951     
 has_blackmarket has_market      has_refuel      has_repair     
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:46586     FALSE:14485     FALSE:7157      FALSE:13333    
 TRUE :21261     TRUE :53362     TRUE :60690     TRUE :54514    
                                                                
                                                                
                                                                
                                                                
 has_rearm       has_outfitting  has_shipyard    has_docking    
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:21408     FALSE:23268     FALSE:51848     FALSE:5548     
 TRUE :46439     TRUE :44579     TRUE :15999     TRUE :62299    
                                                                
                                                                
                                                                
                                                                
 has_commodities is_planetary   
 Mode :logical   Mode :logical  
 FALSE:16943     FALSE:40952    
 TRUE :50904     TRUE :26895    
                                
                                
                                
                                

We can also extract a list of possible ships that are sold and from that we can determine what stations sell which ships.

stations %>% 
  select(selling_ships) %>% 
  unnest() %>% 
  distinct()
stations %>% 
  select(id, name, system_id, selling_ships) %>% 
  unnest() %>% 
  filter(selling_ships == "Imperial Cutter") %>% 
  head()

For instance, we can get a listing of all of the stations that sell the “Imperial Cutter” and what system that station is in.

Questions

  • What are the list of purchasable ships?
  • What ship can be bought where?
  • What module can be bought where?

Populated Systems

The populated systems data gives information on systems in the universe that are populated.

populated_systems <- as_tibble(fromJSON("data/systems_populated.json"))
head(populated_systems)

Like the stations data, the populated systems data has a large number of columns (29) that are mostly metadata about the systems. Most of the metadata is centered on government types, alleigance, security level, power play info, and factions. The information that will likely be useful for us are the following columns:

  • id: The system’s id
  • name: the system’s name
  • x, y, z: The coordinates of the system in the universe

First, I want to look at the summary statistics for the coordinates to verify that there are no missing coordinates. It looks like all of the coordinates are present.

populated_systems %>% 
  select(x, y, z) %>% 
  summary()
       x                  y                 z           
 Min.   :-9557.94   Min.   :-944.12   Min.   :-6947.56  
 1st Qu.:  -49.50   1st Qu.: -99.69   1st Qu.:  -47.94  
 Median :   17.23   Median : -25.05   Median :   14.75  
 Mean   :  -20.58   Mean   : -35.68   Mean   :   84.76  
 3rd Qu.:   83.07   3rd Qu.:  34.25   3rd Qu.:   74.38  
 Max.   : 2704.97   Max.   : 366.66   Max.   :19853.19  

I can make a 3D scatter plot of the coordinates of the systems and show the color of the controlling power. All of this done with plotly and could be a fun way of showing where a particular system is in comparison to other systems. In this plot I’ve only included the top 4 controlling powers (based on number of systems controlled). Since I haven’t used plotly much, this could be a great opportunity to become more familiar with it.

p <- populated_systems %>%
  filter(!is.na(power)) %>% 
  group_by(power) %>%
  mutate(control_number = n()) %>%
  ungroup() %>% 
  mutate(control_number = dense_rank(desc(control_number))) %>% 
  filter(control_number < 5) %>%
  plot_ly(x = ~x, y = ~y, z = ~z, color = ~power) %>% 
  add_markers(opacity = 0.1)
p

Questions

  • Where are the various systems?
  • What powers or factions control each system?

Factions

The factions data set gives metadata on all of the factions involved in the game.

factions <- as_tibble(fromJSON("data/factions.json"))
head(factions)

I am unlikely to use this data set for this particular project, but it could provide additional information if I want to look at specific factions, particularly player created factions.

---
title: "Introduction and EDA for EDDB"
author: "Brian Richards"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

## Introduction

The Elite: Dangerous Database (EDDB) is a collection of data for the game 
Elite: Dangerous. The data is split into several different databases: 

* Prices of goods (listings.csv)
* Stations (stations.json)
* Populated Systems (systems_populated.json)
* Factions (factions,json)
* Commodities (commodities.json)
* Modules (modules.json)

This notebook will be an introduction to each data set and provide some 
exploratory data analysis of the data sets. This EDA will also act as 
prototypes for the information I will want to display in the Shiny app.

```{r, message = FALSE}
library(tidyverse)
library(jsonlite)
library(plotly)
```

## Prices

The `listings.csv` file gives a read out of the prices of goods at the time 
the file was updated.

```{r, message = FALSE}
listings <- read_csv("data/listings.csv", progress = FALSE)
```

```{r}
head(listings)
```

The columns for the `listings` are:

* id: The id of the particular good
* station_id: The station where the good is being sold
* commodity_id: The good being sold (related to commodities.json)
* supply: How much of the item is available for purchase
* supply_bracket: Unsure of purpose
* buy_price: Price the good can be bought at 
* sell_price: Price the good can be sold at
* demand: How many items the station wants
* demand_bracket: Unsure of purpose
* collected_at: Time the data was collected

We can then look at a summary of the data to get an idea of what is in the 
data set.

```{r}
summary(listings)
```

From the summary, we can see that there are number of NA's in the `supply_bracket` 
and `demand_backet` columns. Since we don't know what those do, we can ignore 
those for the time being. Additionally, there are zeroes in the `sell_price` 
and `buy_price` columns. Those are essentially NA's as a `sell_price` of 0 means 
that you can't sell the product there and a `buy_price` of 0 means there are 
none available at that station.

```{r}
listings %>% 
  filter(commodity_id == 1) %>% 
  select(buy_price, sell_price) %>% 
  gather(type, price) %>% 
  filter(price > 0) %>% 
  group_by(type) %>% 
  mutate(average = mean(price), median = median(price)) %>% 
  ggplot(aes(x = price, fill = type)) +
  geom_histogram(bins = 100) +
  facet_grid(type ~.) +
  geom_vline(aes(xintercept = average)) +
  geom_vline(aes(xintercept = median), linetype = "dotted")
```

This figure gives a quick histogram of the buy and sell prices of a single 
commodity across all available stations. Note that it filters out any buy or 
sell prices of 0. The solid line is the mean of the distribution and the dotted 
line is the median of the distribution. From the figure, an interesting topic 
to investigate is looking at comparative buy and sell prices across stations 
to create a system to search for the max difference.

```{r}
listings %>% 
  filter(station_id == 12) %>% 
  select(buy_price, sell_price) %>% 
  gather(type, price) %>% 
  filter(price > 0) %>% 
  group_by(type) %>% 
  mutate(average = mean(price), median = median(price)) %>% 
  ggplot(aes(x = price, fill = type)) +
  geom_histogram(bins = 20) +
  facet_grid(type ~.) +
  geom_vline(aes(xintercept = average)) +
  geom_vline(aes(xintercept = median), linetype = "dotted")
```

This figure looks at a histogram of buy and sell prices for a specific station 
(across all available commodities). Again, the solid line is the mean and the 
dotted line is the median.

### Questions 

I want to put together some initial questions to answer using my Shiny app. 

* What is the biggest price difference?
* What is the distance between the stations with the biggest price difference? 
* Given a particular commodity, where can I find it and at what price?
* Given a particular commodity, where can I sell it and at what price?
* How do the buying and selling price compare the the galatic average (and 
reported average average and median)?

## Stations 

```{r}
stations <- as_tibble(fromJSON("data/stations.json"))
```

```{r}
head(stations)
```

The stations data has 39 columns in it, the most relevant of which are the 
following:

* id: The station's id
* name: The station's name
* system_id: The system id where the station resides
* updated_at: Time at which the data was updated
* distance_to_star: Distance of the station from the system's star
* has_ : A set of booleans showing what amenities the station offers
* is_planetary: A boolean showing whether the station is on a planet
* selling_ships: A list of the ships being sold

There are also a number of other useful variables showing additional information 
about the stations.

We can summarize the most relevant columns and get a sense of distributions and 
any possible NAs.

```{r}
stations %>% 
  select(id, system_id, updated_at, distance_to_star, starts_with("has_"), is_planetary) %>% 
  summary()
```

We can also extract a list of possible ships that are sold and from that we 
can determine what stations sell which ships.

```{r}
stations %>% 
  select(selling_ships) %>% 
  unnest() %>% 
  distinct()
```

```{r}
stations %>% 
  select(id, name, system_id, selling_ships) %>% 
  unnest() %>% 
  filter(selling_ships == "Imperial Cutter") %>% 
  head()
```

For instance, we can get a listing of all of the stations that sell the 
"Imperial Cutter" and what system that station is in. 

### Questions 

* What are the list of purchasable ships?
* What ship can be bought where?
* What module can be bought where?

## Populated Systems

The populated systems data gives information on systems in the universe that 
are populated.

```{r}
populated_systems <- as_tibble(fromJSON("data/systems_populated.json"))
```

```{r}
head(populated_systems)
```

Like the stations data, the populated systems data has a large number of 
columns (29) that are mostly metadata about the systems. Most of the metadata 
is centered on government types, alleigance, security level, power play info, 
and factions. The information that will likely be useful for us are the 
following columns: 

* id: The system's id
* name: the system's name
* x, y, z: The coordinates of the system in the universe

First, I want to look at the summary statistics for the coordinates to verify 
that there are no missing coordinates. It looks like all of the coordinates 
are present.

```{r}
populated_systems %>% 
  select(x, y, z) %>% 
  summary()
```

I can make a 3D scatter plot of the coordinates of the systems and show the 
color of the controlling power. All of this done with plotly and could be a fun 
way of showing where a particular system is in comparison to other systems. In 
this plot I've only included the top 4 controlling powers (based on number of 
systems controlled). Since I haven't used plotly much, this could be a great 
opportunity to become more familiar with it.

```{r}
p <- populated_systems %>%
  filter(!is.na(power)) %>% 
  group_by(power) %>%
  mutate(control_number = n()) %>%
  ungroup() %>% 
  mutate(control_number = dense_rank(desc(control_number))) %>% 
  filter(control_number < 5) %>%
  plot_ly(x = ~x, y = ~y, z = ~z, color = ~power) %>% 
  add_markers(opacity = 0.1)

p
```

### Questions 

* Where are the various systems?
* What powers or factions control each system?

## Factions

The factions data set gives metadata on all of the factions involved in the 
game.

```{r}
factions <- as_tibble(fromJSON("data/factions.json"))
```

```{r}
head(factions)
```

I am unlikely to use this data set for this particular project, but it could 
provide additional information if I want to look at specific factions, 
particularly player created factions.

